Proximity Rank Join Based on Cosine Similarity

نویسنده

  • Marco Tagliasacchi
چکیده

Proximity rank join is the problem of finding the top-K combinations with the highest aggregate score in which the best combinations of objects coming from different services are sought, and each object is equipped with both a score and a real-valued feature vector. The proximity of the objects i.e. the geometry of the feature space plays a distinctive role in the computation of the overall score of a combination. Proximity rank join problem can capture many interesting scenarios e.g. finding similar documents in different data collections given a set of keywords, requesting similar images from different archives given a sample image, to name a few, if the notion of proximity is replaced with the notion of similarity. For this reason, the main goal of this thesis work is the incorporation of the cosine similarity measure into the proximity rank join algorithm. But this incorporation results in an optimization problem which is computationally demanding and complex in nature. Therefore our incorporation is achieved via solving an approximated version of the original optimization problem by making an assumption which we have found to be totally in line with the concept of top-K combinations. Moreover, the solution of our approximated optimization problem has excellent properties in that with the increase of dimension of the feature space (from 2D to 3D) the normalized (in terms of the maximum possible aggregate score) deviation between the true aggregate score and the approximated aggregate score drops and like the true solution it also avoids the triviality,which is unrealistic, that occurs in the solutions of other approximated variants of the original optimization problem.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Efficient Similarity Join Algorithm with Cosine Similarity Predicate

Given a large collection of objects, finding all pairs of similar objects, namely similarity join, is widely used to solve various problems in many application domains.Computation time of similarity join is critical issue, since similarity join requires computing similarity values for all possible pairs of objects. Several existing algorithms adopt prefix filtering to avoid unnecessary similari...

متن کامل

Similarity Joins of Text with Incomplete Information Formats

Similarity join over text is important in text retrieval and query. Due to the incomplete formats of information representation, such as abbreviation and short word, similarity joins should address an asymmetric feature that these incomplete formats may contain only partial information of their original representation. Current approaches, including cosine similarity with q-grams, can hardly dea...

متن کامل

A New Similarity Measure Based on Item Proximity and Closeness for Collaborative Filtering Recommendation

Recommender systems utilize information retrieval and machine learning techniques for filtering information and can predict whether a user would like an unseen item. User similarity measurement plays an important role in collaborative filtering based recommender systems. In order to improve accuracy of traditional user based collaborative filtering techniques under new user cold-start problem a...

متن کامل

Simple and Efficient Algorithm for Approximate Dictionary Matching

This paper presents a simple and efficient algorithm for approximate dictionary matching designed for similarity measures such as cosine, Dice, Jaccard, and overlap coefficients. We propose this algorithm, called CPMerge, for the τ overlap join of inverted lists. First we show that this task is solvable exactly by a τ -overlap join. Given inverted lists retrieved for a query, the algorithm coll...

متن کامل

INFORMATION MEASURES BASED TOPSIS METHOD FOR MULTICRITERIA DECISION MAKING PROBLEM IN INTUITIONISTIC FUZZY ENVIRONMENT

In the fuzzy set theory, information  measures play a paramount role in several areas such as decision making, pattern recognition etc. In this paper, similarity measure based on cosine function and entropy measures based on logarithmic function for IFSs are proposed. Comparisons of proposed similarity and entropy measures with the existing ones are listed. Numerical results limpidly betoken th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010